Hi everyone! For my first Medium posts I wanted to do something fun and interesting. In this post I want to provide a real-world example of the power of a programming language (R in this case, but Python or many other will also work to do this type of analysis), in combination with open data-sources to gather evidence and make data-driven decisions!

The purpose of these series of posts is to look at the COVID-19 mortality and vaccination data per country, to check trends and see if we can draw any major conclusions on the efficacy of the vaccines. The posts will be divided into:

  • Data preparation and exploration: The purpose of this post is to present the sources of the data and some nice figures to better understand it. All the code is available in GitHub: starting from loading the data of interest, cleaning it, merging it into a common data frame and creating some nice exploratory figures.
  • Statistical model: With all the data collected and ready for analysis, we can also fit a statistical model to test the effectiveness of vaccines in reducing the mortality rate of COVID-19. This exercise is just for the purpose of learning and showing how an ecological model works, because we already know from multiple scientific studies conducted at the individual level that COVID-19 vaccines have a high efficacy in preventing COVID-19 deaths (see for example Polack et.al 2020).

The purpose of these series of posts is to show how we can use data to create meaningful figures and statistical models to provide solid fact-based evidence for decision making. All the analysis and figures were done using R, so if you want to explore how everything in this post was created please visit my GitHub page with all the code.

That being said, let’s see what the world data of COVID-19 and vaccination status has to offer us!

1 Introduction of the problem

At the present moment (August 2022), the on-going COVID-19 pandemic has taken more than 6 million human lives worldwide since it started on 2020. Novel vaccines were developed at the end of 2020 to reduce the mortality rate associated with this respiratory infectious disease.

There has been a different magnitude and rate of vaccine adoption among countries, which generates diverse data that could help us answer whether countries with higher vaccination rates are associated with lower COVID-19 case fatality rates (CFR). In the next sections I will define more precisely what CFR is, but for now let’s define it simply as the total deaths divided by the total cases of COVID-19. The CFR gives us an estimate of the percentage of people that die once they got the disease. Obviously, a lower CFR is desirable.

As mentioned before, it has already been proved in multiple independent scientific studies that vaccines do work in reducing the CFR of COVID-19 deaths (see here), this is just an academic exercise to play with some data and statistical models for a problem in which we already know the answer.

2 Available Data

2.1 COVID-19 Cases and Deaths

Let’s start with COVID-19 cases and deaths per country data. As an international organization with the mission to improve human health, the World Health Organization (WHO) has been compiling daily country level data for the COVID-19 pandemic. This data is easily accessible to the general public (link to download). The data contains COVID-19 daily related cases and deaths for a variety of countries.

Here are the resolutions of this data set:

  • Temporal resolution: Daily, from 2020 (January) to 2022 (present)
  • Geographical resolution: Country level (237 different countries) and world region (7 in total)
  • Main outcome to study: Case Fatality Rate (CFR). It is a measurement of amount of patients with COVID-19 that died (\(CFR=\frac{Deaths}{Cases}\))

2.2 Vaccination data

As we are interested in exploring the effect of vaccines on the case fatality rate for COVID-19, we also need vaccination data per country and month. Our World in Data is a data organization that compiles country level data from multiple sources to make research more easy. They have a made a strong effort to compile a database with the total number of people fully vaccinated (people who received all doses prescribed by the vaccination protocol) by date and country. The vaccination data is available here, with a public GitHub. Here are the resolutions of the data set:

  • Temporal resolution: Daily, from 2020 (December) to 2022 (present)
  • Geographical resolution: Country (235 different countries)
  • Explanatory variable: Total people vaccinated per 100 habitats (or percentage of people fully vaccinated)

This vaccination metric was chosen for several reasons: It is already normalized by total population in each country, so it is comparable independent of the size of each country. One caveat is that only considers people that have completely fulfill the dosage requirements for the vaccine administered, whether is 1 or 2 doses required. In the scope of this post I will not consider the vaccine company or version (Sinovac, Pfizer, Moderna, among others), but it is a factor that could be considered in future posts.

3 Data preparation

We finally finished with all introductions, is time to start introducing all the data. First we need to load both data sets, and merge them based on the country name. This will be the first difficulty, as the country names sometimes differ in the text composition, or sometimes in the geographical delimitation (some countries exist in the WHO data set, but are not present in the OurWorldInData, or viceversa).

I decided to aggregate all mortality and vaccination data into monthly units. I believe a month is a good aggregation unit that could represent the pandemic status for that country for the CFR (it takes a 30 day data on cases and deaths to estimate the CFR) and for the vaccination status. Other aggregations are possible, such as daily (too much noise) or quarterly (less observations). This is just one of the many modelling decisions that I had to take for this analysis.

As the COVID-19 disease usually takes two weeks to fully develop into a mortal disease, a two week shift in the death data (bring backward) was considered to align cases with deaths in the estimation of the monthly CFR for each country.

We also need to clean the data. This is a really important step, as data sets with this amount of observations always have weird things inside. I took a simple approach here and remove all observations with negative deaths or cases. To improve the estimation and robustness of the analysis, the CFR was estimated for each country only in the months in which there were at least 100 cases and more than 10 COVID-19 deaths.

As I mentioned above, I had to join both data sets based on the country name. As the country names sometimes differ by small details, I manually corrected them to increase the number of matches (for example: United States and United States of America were matched together). The resulting data frame for analysis contains data on 189 different countries, not bad at all! For these countries we got reliable information on COVID-19 cases and deaths, and vaccination status for a given number of months, that may differ according to the start date of vaccination campaigns in each country.

4 Exploratory Data Analysis

We finally finished with all the data preparation, thanks for sticking with me to this point. Now comes the exploratory part, which will be much more fun and interactive!

Here are some summary statistics for the whole working data.

From this summary table we can already observe some interesting facts:

  • For the 189 countries considered, we have an average number of months with an estimation for the CFR of 12.43, with countries with only 1 month of data and countries with 20 months of data.
  • The CFR has a mean value of 1.86%, but a really high variance (the variance is the square of the sd, so in this case will be 8.53%). The range of the metric is also really high, with countries with values of CFR above 75% in a given month.
  • We observe also a high dispersion in the vaccination level across countries, with some countries reaching more than 90 people fully vaccinated per 100 people, and countries with no vaccination at all.

Now is time to make some figures for reach data set, hopefully this can describe better all the valuable information we have!

4.1 COVID-19 Data

We know that the COVID-19 pandemic has evolve over time, so it is worth to take a look at the time series of the COVID-19 data. Our main outcome of interest is the case fatality rate (CFR) for COVID-19. Let’s look at the total case fatality rate for COVID-19 for the entire pandemic period. Thanks to plotly library we can make our figures interactive, so for all figure you can explore each individual point with your mouse pointer! Please take a time to play with the figures and learn new insights!

Remember that all the data was grouped into monthly units (from December 2020 to July 2022) for each country, so each point in the plot represents a month for a given country.

We can see that the case fatality rate is around 2% for the majority of the observations gathered, that correspond at monthly rates for each country. We observe variations across each country and across WHO regions, with Europe and Western Pacific with the lowest CFR.

Let’s look at the case fatality rate over time in a more classic time series plot (each color is a world region):

We observe a higher value in the CFR in the early times of the pandemic, and an important reduction in the most recent times. We also observe important peaks and differences across WHO regions, probably due to different variations of the COVID-19 virus and different climatic seasons (northern and southern hemisphere). Overall we observe a reduction in the recent time, could it be attributed to the COVID-19 vaccines?

4.2 Vaccination Data

The next figure shows the current vaccination status for each country at the end of July 2022. We observe important differences in the access to vaccines across WHO regions, and within countries in the same region (each point in the figure represents a country). For example, Europe present a high vaccination rate, but we observe important differences in the vaccination level within the countries of Europe. The distribution of vaccination levels seems to be two-modal, with some countries with high vaccination and some with really low.

Again, the interactive tool of the figure make it easier to explore the condition for each country!

4.2.1 CFR vs Vaccination Rates

Let’s make a comparison between the percentage of vaccinated people and the monthly CFR for each country. This comparison directly relates to our question of interest: do vaccines reduce the CFR? The following boxplot summarizes the case fatality rate over different levels of vaccination in each month for each country. The analysis is divided by world region:

We observe for each region a clear reduction in the CFR for COVID-19 associated with higher vaccination levels (to see this simply compared the purple-low vaccination to the red-high vaccination CFR. This reduction occur both in the average value and in the variation across countries in each month. This is interesting, as it has a double effect in reducing the variance of CFR as well: every country with high vaccination rates is in the low spectrum of CFR. Point to the vaccines!

We do observe that sometimes the middle vaccination categories (blue and green) overlap each other, and in some regions the CFR reduction effect is not clear. This could happen due to several reasons, such as border effects in the classification (the data was discretized a numeric variable into a bin category), noise in the data or higher efficacy of vaccines above certain threshold. Nonetheless, the figure is a great summary of information: it shows both the efficacy of vaccines and the vaccination state per world region. We CAN observe that some regions are left with low vaccination levels, like Africa. The situation is also true for some countries inside each region.

This figure seems to support our main hypothesis of reduction of CFR in countries with higher vaccination rates. In the next post we will construct a statistical model to test this hypothesis formally. For now, I think this great figure is a nice way to end this post with some supporting evidence that vaccines do work and reduce the CFR for COVID-19!

Please read the second part of this post for the formal statistical analysis to test our hypothesis. We will use all the data gathered in this post to create a formal model and draw more robust inference in the effect of vaccines!

Acknowledgement

This post was created based on my personal project for the course “STA207:Statistical Methods for Research II”, part of my M.Sc. in Statistics and Data Science. For all the valuable comments and feedback I received during my project, I acknowledge the instructor Professor Shizhe Chen and his materials for statistical methods of research. I also acknowledge the following classmates for their valuable feedback and comments during the project development: Yinan Cheng, Shuyu Guo, Kyung Jin Lee, Katherine Cheng, Oscar Rivera and Jedidiah Harwood. I will also acknowledge my friend Alonso Perez for all valuable comments on this post!

References

  • Course Notes STA207 UC Davis Winter Quarter 2022. Professor Shizhe Chen.
  • World Health Organization (WHO). (2022). WHO Coronavirus (COVID-19) Data. See: https://covid19.who.int/info
  • Our World in Data. (2022). Coronavirus (COVID-19) Vaccinations Data. https://ourworldindata.org/covid-vaccinations
  • Liang, L. L., Kuo, H. S., Ho, H. J., & Wu, C. Y. (2021). COVID-19 vaccinations are associated with reduced fatality rates: Evidence from cross-county quasi-experiments. Journal of Global Health, 11.
  • Passarelli-Araujo, H., Pott-Junior, H., Susuki, A. M., Olak, A. S., Pescim, R. R., Tomimatsu, M. F., … & Urbano, M. R. (2022). The impact of COVID-19 vaccination on case fatality rates in a city in Southern Brazil. American Journal of Infection Control.
  • Haldar, A., & Sethi, N. (2020). The effect of country-level factors and government intervention on the incidence of COVID-19. Asian Economics Letters, 1(2), 17804.
  • Florian Hartig (2022). DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models. R package version 0.4.5. http://florianhartig.github.io/DHARMa/
  • Polack, F. P., Thomas, S. J., Kitchin, N., Absalon, J., Gurtman, A., Lockhart, S., … & Gruber, W. C. (2020). Safety and efficacy of the BNT162b2 mRNA Covid-19 vaccine. New England Journal of Medicine.